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Abstract Human coronavirus 229E has been identified in 
the mid-1960s, yet still only one full-genome sequence is 
available. This full-length sequence has been determined 
from the cDNA-clone Inf-1 that is based on the lab-adapted 
strain VR-740. Lab-adaptation might have resulted in 
genomic changes, due to insufficient pressure to maintain 
gene integrity of non-essential genes. We present here the 
first full-length genome sequence of two clinical isolates. 
Each encoded gene was compared to Inf-1. In general, little 
sequence changes were noted, most could be attributed to 
genetic drift, since the clinical isolates originate from 2009 
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to 2010 and VR740 from 1962. Hot spots of substitutions 
were situated in the SI region of the Spike, the nucleo- 
capsid gene, and the non-structural protein 3 gene, whereas 
several deletions were detected in the 3'UTR. Most notable 
was the difference in genome organization: instead of an 
ORF4A and ORF4B, an intact ORF4 was present in clin¬ 
ical isolates. 

Keywords Human coronavirus 229E • 

Respiratory tract infection • Complete genome sequence • 
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Introduction 

Coronaviruses are a large group of viruses that infect a lot 
of animal species such as mammals and birds. Coronavi¬ 
ruses are enveloped, plus strand RNA viruses and belong to 
the Coronaviridae family. The genomes are linear, non- 
segmented, and single strand. The coronavirus genomes 
with 27-31.5 kb in length are the largest of the known 
RNA viruses. The genomes are polycistronic generating a 
nested set of subgenomic RNAs with common 5' and 3' 
sequences [1]. Based on serologic and phylogenic rela¬ 
tionship, coronaviruses are classified into three genera. 
Alphacoronavirus and betacoronavirus consist of various 
mammalian coronaviruses, whereas gammacoronavirus 
includes bird viruses [2]. The genus of Alphacoronavirus 
includes transmissible gastroenteritis virus (also referred to 
as alphacoronavirus I; ICTV 2009), porcine epidemic 
diarrhea virus, some bat coronaviruses, and the human 
coronaviruses (HCoVs) NL63 and 229E. In general, 
HCoV-229E virus causes common cold but occasionally it 
can be associated with more severe respiratory infections in 
children, elderly and persons with underlying illness [3-5]. 
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So far only one full genome has been determined for 
HCoV-229E [6]. This reference sequence is obtained from 
the infectious HCoV-229E cDNA clone (Inf-1) that is 
based on the 1973-deposited laboratory-adapted prototype 
strain of HCoV-229E (VR-740). The 1973-deposited pro¬ 
totype strain was originally isolated in 1962 from a medical 
student with an upper respiratory infection at the Univer¬ 
sity of Chicago [7, 8]. Of the current HCoV-229E isolates, 
only limited sequence data have been obtained. Chibu and 
Birch [5] have investigated the evolution of HCoV-229E 
by sequencing part of the S and the N gene from clinical 
samples collected between 1979 and 2004. 

Sequence data from other genomic regions are still 
lacking, as no full-genome sequence of a non-lab-adapted 
virus is available. 

Materials and methods 

Clinical samples 

We propagated several contemporary strains of HCoV- 
229E upon pseudostratified human airway epithelial. One 
of these, clinical strain HCoV-229E 0349 (21050349), was 
isolated from a respiratory swab collected in the Nether¬ 
lands in 2010, from a stem cell transplantation recipient, 
who presented in hospital with fever and respiratory 
infection. The virus was propagated upon human airway 
epithelial cells, as described previously [9]. The apical 
supernatant was harvested 72 h post-infection by apical 
washing. A second clinical isolate was uncultured, J0304, 
obtained from an adult with symptoms of lower respiratory 
tract infections in Italy, collected via the GRACE European 
Network of Excellence [10]. Ethics review committees in 
each country approved the study, and written informed 
consent was provided by all study participants. 

Full-genome sequencing of the HCoV-229E clinical 
isolates 

Total RNA was extracted from the apical washing from 
0349 and from the Copan collected swab of J0304 (Copan 
Diagnostics) as described [11]. Reverse transcription was 
performed at 37 °C for 1 h using random hexamers and 
superscript II (Invitrogen). The HCoV-229E Inf-1 reference 
sequence (Accession number NC_002645.1) was used as 
scaffold for designing bidirectional PCR-primer combina¬ 
tions, amplifying an average fragment length of 500 bp with 
a minimum overlap of 80 bp with adjacent primer combi¬ 
nations. Primers sequences are available upon request. 
Amplification of the fragments was performed with the 
following thermal cycle profile: 5 min at 95 °C, 45 cycles of 
95 °C for 1 min, 55 °C for 1 min, and 72 °C for 2 min, 


followed by a final elongation step of 7 min at 72 °C. PCR 
fragments were visualized upon agarose gel electrophoreses 
by ethidium bromide staining. Positive PCR fragments were 
directly sequenced with their forward and reverse primers in 
both the directions. Sequencing reactions were performed 
according to the BigDye Terminator vl.l protocol (ABI life 
science). Sequences were analyzed with Coloncode Aligner 
software (version 3.7.1). Sequences have been submitted to 
GenBank (JX503060, JX503061). 

5' and 3' RACE 

In order to complete the full-genome sequence with 5' and 
3' termini, 5' and 3' RACE was performed. The 5' end was 
determined with the 5' RACE kit (Invitrogen) according to 
the manufactures protocol. Gene-specific primers for 5' 
RACE PCR amplification were designed to flank approx¬ 
imately 100 nt of the 5' region. The 3' end of HCoV-229E 
clinical strain was determined with 3' RACE, with an RT 
reaction performed using the Oligo-dT-JZH primer and 
PCR amplification with the JZH primer and a gene-specific 
primer [9].The PCR products were excised after agarose 
electrophoresis and purified with the Nucleospin Extract II 
kit (Machery-Nagel) according to the manufacture proto¬ 
col. Purified PCR products were cloned into the pCRII- 
TOPO TA vector (invitrogen) and chemically competent 
E. Coli according to manufacture protocol (Top 10 cells, 
Invitrogen). Transformants were directly analyzed via 
colony PCR with T7 and M13Rev primers. PCR products 
were sequenced as described above. 

Full-genome sequence analysis 

The ZCURVE_CoV 1.0 program was used to recognize and 
predict putative proteins coding genes [12, 13]. Phylogenic 
analyses (neighbor-joining method) were conducted using 
MEGA, version 4.02. The identity between HCoV-229E 
clinical isolates and the reference sequence of 229E (inf-1 
NC_002645.1) was investigated by pairwise alignment 
using BioEdit Sequence Aligner. Simplot (version 3.5.1) 
was used to draw similarity/distance plots. N- and O-linked 
glycosylation sites and signal peptide cleavage sites were 
predicated using the NetNGly 1.0, NetOGly 3.1, and Sig- 
nalP 4.0 analysis tools, from the Center for Biological 
Sequence Analysis (http://www.cbs.dtu.dk/services/). The 
identity comparisons per gene were investigated by pair¬ 
wise alignment using BioEdit Sequence Aligner. 

Results and discussion 

The full-genome HCoV-229E strains 0349 and J0304 
consist of round about 27.240 nt. The GC content is 
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38.07 % in 0349 and 38.13 % in J0304; this percentage is 
38.26 for the reference sequence of the laboratory-adapted 
virus (inf-1, NC_002645.1). Between the two clinical iso¬ 
lates not much sequence difference was noted: only 135 
nucleotide differences, one codon insertion/deletion in the 
S gene and a 2 nt insertion/deletion in the 3 / UTR. 
According to the ZCURVE_CoV 1.0 program, the genome 
organization of 0349 and J0304 is similar to the reference 
sequence with one major difference. The clinical strains of 
HCoV-229E have seven putative protein-coding genes, 
while there are eight in the reference sequence. This dif¬ 
ference rises from the fact that the clinical strains have an 
intact ORF4, the reference has ORF4A and ORF4B. 
Table 1 shows the nucleotide and amino acid similarities 
among the ORFs of HCoV-229E strain 0349, J0304, and 
the reference sequence. Most distances (>2 % at nt level) 
are observed at the non-structural protein 3 (NSP3) gene, 
spike gene, nucleocapsid gene, and the 3 / UTR. 

Full-genome alignment of 0349 and J0304 with the 
reference sequence reveals 168 and 175 substitutions, 
respectively, in the la replicase gene of which 66 are non- 
synonymous in both. In the lb replicase gene, there are 84 
in 0349 and 81 in J0304 substitutions resulting in 20 and 19 
amino acid changes, respectively. Furthermore, one dele¬ 
tion and an insertion were observed. The NSP3 gene 
encodes for the largest non-structural protein and 


comprises 1,594 amino acids residues. This protein is a 
multi-functional protein and acts as Papain-Fike protease 
(PF pro) and also has catalytic activity [14]. In 0349 and 
J0304, we observed 74 nt substitutions (34 non-synony- 
mous in 0349 and 33 in J0304), a 21 nt insertion at position 
286, and a 6 nt deletion at the positions 316-321. The 
NSP4 gene is situated between the two autoproteolytical 
proteins NSP3 and NSP5. There are 21 nt substitutions in 
0349 that cause five amino acid changes. In J0304, there 
are 19 substitutions, 4 of them are non-synonymous. NSP5 
has a proteolytic role with cysteine protease activity. It is 
very important in viral replication and, therefore, often 
referred to as the main protease (Mpro) [14, 15]. The 
Mpro-mediated processing pathways are well conserved in 
all coronaviruses and it cleaves as many as 11 ppla/pplb 
sites to produce a total of 13 mature proteins [16]. 
Although there are 7 changes at nucleotide acid level in 
0349, there is only one amino acid difference (E222D). The 
same for J0304 with 10 substitutions that cause only two 
changes in amino acid level (F12F and E222D). The NSP6 
gene encodes a membrane-spanning protein [14]. Eight 
nucleotide substitutions are found in 0349 and 7 in J0304, 
of which one is non-synonymous (V86I). The HCoV-229E 
genome encodes several small non-structural proteins, like 
NSP7 to NSP10, that have RNA-binding activity and are 
believed to be involved in viral RNA synthesis [14]. There 


Table 1 Nucleotide and amino 
acid identity among the ORFs of 
clinical isolates and Inf-1 
HCoV-229E 



% Amino acid identity 

% Nucleotide identity 

% Amino acid similarity 

0349 

J0304 

0349 

J0304 

0349 

J0304 

ORFla 

98.0 

98.1 

98.3 

98.3 

98.7 

98.8 

ORFlb 

99.2 

99.2 

98.9 

98.0 

99.5 

99.6 

NSP1 

99.1 

99.1 

99.4 

99.4 

99.1 

99.1 

NSP2 

97.6 

97.4 

98.3 

98.3 

98.5 

98.7 

NSP3 

97.2 

97.2 

97.8 

97.8 

98.0 

98.1 

NSP4 

99.0 

99.0 

98.5 

98.7 

99.8 

99.6 

NSP5 

99.7 

99.3 

99.2 

98.9 

100 

99.7 

NSP6 

99.6 

99.6 

99.0 

99.2 

100 

100 

NSP7 

100 

100 

99.6 

98.8 

100 

100 

NSP8 

99.5 

99.5 

99.0 

99.0 

99.5 

99.5 

NSP9 

99.1 

99.1 

99.1 

98.5 

99.1 

99.1 

NSP10 

98.5 

99.3 

98.5 

98.0 

99.3 

99.3 

NSP12 

98.7 

98.8 

98.7 

98.8 

99.3 

99.5 

NSP13 

99.5 

99.3 

98.9 

98.9 

99.7 

99.5 

NSP14 

99.8 

99.7 

98.9 

98.9 

99.4 

100 

NSP15 

99.4 

99.4 

99.3 

99.3 

99.7 

99.7 

NSP16 

99.7 

99.7 

99.2 

99.1 

100 

100 

S 

94.8 

94.7 

96.4 

96.4 

96.8 

96.8 

E 

98.7 

98.7 

99.1 

99.6 

98.7 

98.7 

M 

99.5 

99.5 

98.5 

98.5 

99.1 

99.1 

N 

98.2 

98.5 

97.8 

97.8 

98.5 

98.4 
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is one substitution in the NSP7 gene in 0349 and 3 in J0304 
that have no effect on the encoded protein. In NSP8, there 
are 6 nt substitutions, all except one are synonymous 
(I179T). There are three nucleic acids substitutions in 
NSP9 of 0349 and five in J0304, one of them non-synon- 
ymous (T23I). Within the NSP10 gene there are 6 nt sub¬ 
stitutions in 0349, including two that change the amino 
acid sequence (S18A and C57S). Although we have 8 nt 
substitutions in J0304, there is only one that has effect on 
amino acid level (C57S). 

NSP12, or RNA-dependent RNA polymerase, consists 
of 927 amino acids. In 0349, there are 35 and in J0304 
34 nt substitutions that result in 12 and 11 changes, 
respectively, in amino acid levels in comparison with the 
reference sequence (N4S, A32V, K131R, E134G, S147N, 
S233A, M458I, S524L, S757G, G767E, I842V, H906Q). 
Moreover, there is a difference in cleavage site between 
NSP10 and NSP12. In the reference, the cleavage site is 
TAIQ/SFDN, whereas it is TAIQ/SFDS in both clinical 
strains. 

NSP13 encodes the viral helicase. We noticed 19 nt 
substitutions in 0349 and 20 in J0304, only three of them 
causing changes at the amino acid level (N58T, N352T, 
and 159IV). In J0304, there is one more non-synonymous 
substitution (I475T).The same pattern is observed in 
NSP14 of 0349, the gene encoding the exoribonuclease: 17 
nt substitutions and only three changes in amino acid 
sequence (T254N, D372E, and E514D), and in J0304, there 
are 13 mutations that cause two amino acid changes in 
protein (D372E and E514D). In NSP15, seven substitutions 
of which two are non-synonymous (P187H and V307L), 
and in NSP16 there are 7 nt substitutions in 0349 and 8 in 
J0304, of which one is non-synonymous (E270D). 

The S gene encodes two spike protein domains, SI and 
S2 (codons 1-560 and 561-1,173, respectively). Similar to 
the reference strain no furin cleavage site is present 
between the SI and S2 domain. Alignment of the S genes 
identified two deletions in both clinical strains and 117 nt 
substitutions, of which 56 are non-synonymous substitu¬ 
tions in 0349 and 57 non-synonymous in J0304. 

Most amino acid changes (including 48 amino acid 
substitutions in 0349 and 49 amino acid substitutions in 
J0304) and the deletions are within the SI domain, espe¬ 
cially at positions 223-229, 307-324, 349-358, and 
401-411. The deletions in 0349 result in two codon dele¬ 
tions in SI at amino acid positions 228 and 354. The 
deletions in J0304 result in three codon deletions in SI at 
amino acid positions 228, 354, and 355. Forty-four of the 
non-synonymous changes and the two codon deletions 
have been described before by Chibo and Birch [5], sug¬ 
gesting hat they are shaped by evolution and the positive 
selection pressure on the S1. Amino acid changes in the S1 
allow escape to virus neutralizing antibodies [5]. 


The receptor of HCoV-229E is human aminopeptidase 
N that is recognized by the SI region. Studies show that 
between amino acids 417 and 517 there is an important 
region for binding to the receptor [17]. In this region, our 
clinical isolates have four synonymous and five non-syn¬ 
onymous substitutions. One study indicated that the area 
between 278 and 329 amino acids is also important for 
binding aminopeptidase N [18]. In this region, the clinical 
isolates have 10 non-synonymous substitutions. Of the total 
15 amino acid changes that might affect receptor binding, 
all have been described previously [5]. Despite the amino 
acid changes at the receptor-binding regions, we have no 
indications that the clinical isolate 0349 was different in 
cell tropism compared to the Inf-1 isolate. Both strains 
have identical cell tropism in pseudostratified respiratory 
epithelium cultures (R. Dijkman et al., manuscript in 
preparation). 

The putative S proteins of 0349 and J0304 contain, 
respectively, 27 and 26 potential V-glycosylation sites 
upon analysis with the NetNgly 1.0 analysis tool, from the 
Center for Biological Sequence Analysis (http://www. 
cbs.dtu.dk/services/), while the reference sequence con¬ 
tains 24 potential V-glycosylation sites. Indeed these 24 are 
conserved and there are three extra at positions 20, 111, 
and 488 in isolate 0349, and in J0304 there are two extra at 
positions 111 and 488. The predicted signal peptide of S in 
the reference sequence is present at amino acids 1-16, 
using the SignalP 4.0 tool from the Center for Biological 
Sequence Analysis. The predicted signal peptide of both 
clinical isolates is also located at amino acids 1-16, with 
potential cleavage site between 16 and 17. 

The S2 part contains the heptad repeats (HR1 at codons 
777-916; HR2 at codons 1,057-1,105) [19], the trans¬ 
membrane domain (codons 1,117-1,138), and a cytoplas¬ 
mic tail. There are only eight amino acid substitutions, one 
of them is located in the HR1 region position 871 (T871I). 
All have been described previously by Chibo and Birch [5]. 
None of the amino acid changes was located in the trans¬ 
membrane domain or the cytoplasmic tail. 

A phylogenetic analysis which includes our clinical 
isolates, the reference sequence, and various S sequences 
from clinical samples collected between 1979 and 2004 
provides further evidence of divergence in time as shown 
previously (Fig. la) [5]. Chibo and Birch presented four 
phylogenetically distinct S gene sequences of which the 
clustering matched with the year of isolation, indicating 
genetic drift in time. Isolate 0349 and J0304 cluster with 
the group 4 viruses, a group that contains all GenBank 
S-sequences that have been collected from 1999 onwards. 

The ORF4 gene is located between the spike and 
envelope gene. Its function is unknown, yet studies with 
HCoV-NF63 accessory protein ORF3, a homolog of 229E- 
ORF4, revealed that the protein is incorporated into virions 
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Fig. 1 Phylogenetic analysis of 
the a spike and b nucleocapsid 
genes of the HCoV-229E 
clinical isolates, Inf-1 and those 
available in GenBank 
(accession numbers, year, and 
country of collection are 
indicated) 
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Table 2 The leader and body TRSs in HCoV-229E clinical isolates (0349, J0304) and laboratory adapted isolate (Inf-1) 



TRS core structure 

Position Inf-1 

Position 0349 

Position J0304 

Leader 

UCUCAACUAAAN 220 AUG 

62-295 

62-295 

62-295 

S 

UCUCAACUAAAUAAAAUG 

20,571-20,588 

20,570-20,587 

20,570-20,587 

ORF4 

UCAACUAAAN 38 AUG 

24,054-24,103 

24,053-24,102 

24,050-24,099 

E 

UCUCAACUAAN 152 AUG 

24,599-24,764 

24,599-24,763 

24,596-24,760 

M 

UCUAAACUAAACGACAAUG 

24,991-25,009 

24,990-24,008 

24,987-25,005 

N 

UCUAAACUGAACGAAAAGAUG 

25,680-25,700 

25,679-25,699 

25,676-25,696 


and is, therefore, an additional structural protein [13]. 
There are major differences between the clinical isolates 
and the reference sequence. Most important is a 2 nt 
deletion in the reference strain resulting in an interruption 
of the gene, whereas the two clinical isolates have an intact 
ORF4, as previously published [20]. The putative full 
protein is 219 amino acids in size. Besides the insertion/ 
deletion 17 nt changes are observed (equal for both clinical 
isolates), resulting in nine amino acid substitutions. 

The envelope protein encodes a 77 amino acid protein, and 
the clinical isolates have only 2 nt different from the refer¬ 
ence, one of these results in an amino acid change (VI21). 

The membrane gene consists of 678 nucleotides, 
encoding a 225 amino acid protein. Both clinical isolates 
have 10 nt differences with the reference sequence, of 
which only one is non-synonymous (F82L). 

The nucleocapsid protein plays a fundamental role in 
virus assembly and RNA synthesis [21]. Among the 
1,170 nt that encode the N protein 26 nt substitutions with 
seven changes in amino acid sequence were noted in 0349 
and in J0304 there are 25 nt substitutions that cause eight 
changes in amino acid level. Three of these changes are 
located in a hot spot between amino acids 224 and 228. 
Inspection of N-gene sequences that are available in Gen- 
bank revealed that six of the seven amino acid changes 
have been present in the circulating HCoV-229E strains for 
decades. This includes the 224-228 hot spot that has been 
present as early as 1982 [5]. Phylogenetic analysis shows 
clustering with strains obtained most recently (Fig. lb). 

In the coronavirus genome, transcription-regulating 
sequences (TRS) are present in 3' end of the leader sequence 
and upstream of each structural gene. For both clinical 
isolates, the TRS has the core structure UCUCAACU, 
except the ORF4 gene for which the TRS is UCAACU. The 
TRS core structure for the M and the N gene have a 1 nt 
different UCUAAACU, that is also found in the reference 
sequence. All are exactly the same as in the reference 
sequence (Volker Thiel, personal communication), including 
the position of the TRSs with its adjacent AUG (Table 2). 

Evaluation of the 3'UTR reveals significant variation. In 
the clinical isolate 0349, there are 8 nt substitutions and 
three deletions observed. In J0304, there are nine 


substitutions and two deletions. The largest deletion in both 
isolates is a 38 nt deletion. Furthermore, two short deletions 
with lengths of 2 and 4 nt are observed in 0349, and a 4 nt 
deletion in J0304. The effect of such deletions on 3'UTR 
structure and function is unknown. For the betacoronavi- 
ruses MHV and BCoV, it has been proposed that there are 
two conserved RNA structures at the upstream end of the 
3'UTR: a bulged stem-loop and an adjacent pseudoknot 
[22]. There is a highly conserved pseudoknot in all alpha- 
coronaviruses but not a detectable counterpart of the bulged 
stem-loop in any proximity, upstream or downstream of the 
pseudoknot [22, 23]. Strain 0349 showed most substitution 
(7/8) and deletions in the first 150 nt of the 3'UTR. Sur¬ 
prisingly, the downstream remainder of the 3'UTR, in which 
we observe only 1 substitution, is labeled hyper-variable 
region (HVR). The HVR is marked as highly divergent in 
sequence and structure, even among closely related coro- 
naviruses [24] . This HVR region harbors the octanucleotide 
5 / -GGAAGAGC-3 / , which is conserved in all coronavirus 
3'UTRs, situated around 70-80 nt from the 3' end of the 
genome [25]. In HCoV-229E strain 0349 and J0304, this 
region is also conserved at 73 nt from the 3' end. 

In this study, we report the first full-genome sequences 
of two non-laboratory adapted strains. Alignment of 
nucleotide and protein sequences and phylogenetic analysis 
of the two HCoV-229E strains showed several differences 
with the reference sequence. Genetic drift was noticed in 
the spike gene, and the only part of the genome truly 
affected by lab-adaptation is the ORF4 gene. 
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